Lab One - Exploring Table Data¶

  • Group: Lab One 3
    • Salissa Hernandez
    • Juan Carlos Dominguez
    • Leonardo Piedrahita
    • Brice Danvide

1. Business Understanding¶

  • The Hotel Bookings Dataset contains a comprehensive collection of information pertaining to hotel bookings, cancellations, and guests' details for Resort and City Hotels. From the dataset's description, "it provides insights into various aspects such as the type of hotel, the number of adults, children, and babies per booking, the length of stay, the meal plan chosen by guests," and other features. The travel industry is booming following the COVID restrictions that were in place over the past few years. However, cancellations are inevitable and can negatively impact hotel operations. The provided data in this dataset can facilitate analysis by examining the factors that are likely to lead to a cancellation, such as long lead times, the average daily rate, or even past cancellation behaviors. This dataset aims to identify these trends to predict whether a cancellation will occur or not.

  • The dataset contains 119,390 rows and 33 features. The target feature is is_canceled, which indicates whether a reservation was canceled or not. Although the methods for acquiring this dataset are not explicitly stated, it is reasonable to infer that it was collected by hotel management systems or third-party booking platforms. The purpose for its collection was likely to assess the factors and trends leading to cancellations in reservations. Cancellations can financially impact hotel operations, especially if they occur last minute, after hotels have already allocated their budgets.

  • The prediction task for this dataset is to predict whether a cancellation will happen or not based on trends or patterns identified in the data. For example, if city hotels experience more cancellations than resort hotels, it could indicate guests value resort hotels more than city hotels. Accurate predictions would be beneficial to various stakeholders:

    • Hotel Managers: They could use the model to minimize revenue loss and optimize operations by adjusting pricing or offering last-minute deals for canceled rooms.
    • Marketing Teams: Insights from cancellation trends could help marketing departments design campaigns that attract guests less likely to cancel, or improve service offerings to reduce cancellation rates.
    • Local Businesses: Hotels often operate near popular destinations (e.g., Disneyland or downtown areas), so reducing cancellations ensures more tourists stay, benefiting nearby businesses.

Measure of Success¶

To determine whether the prediction algorithm is successful, several criteria should be considered:

  • Accuracy: The model should perform significantly better than random chance (50%). A success rate of over 90% would indicate that the model is effective in predicting cancellations, ensuring more reliable business decisions. An accuracy of 90% or higher would help hotels confidently adjust their overbooking strategies, dynamically price rooms, and increase operational efficiency.

  • Precision and Recall:

    • Precision: It is crucial to avoid incorrect predictions of cancellations (false positives) as these could lead to overbooking and customer dissatisfaction.
    • Recall: Ensuring that most cancellations are correctly predicted (high recall) minimizes unexpected losses due to no-shows.
    • A balanced F1 score (harmonic mean of precision and recall) would indicate an optimal trade-off between avoiding false positives and false negatives, making the model more practical for real-world use.
  • Practical Impact: The model needs to be robust and flexible to handle real-world variability. For example, it should accommodate fluctuating booking patterns during peak seasons and varying customer behavior. Additionally, a successful model should help hotels maintain occupancy rates close to industry benchmarks (typically around 70%-80%) by dynamically adjusting prices for canceled rooms or offering last-minute deals to minimize revenue loss.

Real-World Examples and Benchmarks¶

  • Overbooking Strategies: Many hotels rely on predictive models to allow for overbooking, ensuring occupancy rates near 100%. Accurately predicting cancellations enables hotels to avoid losing revenue from empty rooms while preventing too many customers from arriving with insufficient room availability.

  • Revenue Management: Accurate cancellation predictions can lead to a 10-15% improvement in overall hotel revenue by optimizing pricing strategies and maximizing room availability.

  • Operational Efficiency: By predicting cancellations and adjusting for them in advance, hotels can better allocate their resources, adjust staffing levels, and reduce operational costs. For instance, if a high probability of cancellations is predicted for a particular weekend, hotel management can optimize staff scheduling and avoid overstaffing.

Dataset Resources:

  • https://www.kaggle.com/datasets/thedevastator/hotel-bookings-analysis/data
  • https://data.world/mesum/hotel-bookings-dataset/workspace/file?filename=hotel_bookings.csv

Kaggle Dataset: Hotel Bookings Analysis contains data collected from a city hotel and resort hotel with the purpose of analyzing booking patterns and cancellations to optimize operations and improve customer satisfaction.

data.world Dataset:Hotel Bookings is similar to Kagga Dataset, including information such as bookings and customer demographics. This set was gathered to analyze booking trends and predict cancellations to optimize operations.

2. Data Understanding¶

In [1]:
# Modules & Libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.dates import DateFormatter, MonthLocator
import matplotlib.dates as mdates
import copy
import umap.umap_ as umap
from sklearn.preprocessing import StandardScaler
from matplotlib.animation import FuncAnimation

2.1 Loading the Dataset & Defining Data Types¶

In [2]:
# Loading the dataset
path = '../Data/hotel_bookings.csv'

# Read in csv file
df = pd.read_csv(path)
df.head()
Out[2]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 01-07-15
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 01-07-15
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 02-07-15
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 02-07-15
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 03-07-15

5 rows × 32 columns

In [3]:
# Note that the describe function defaults to using only some variables
df.describe()
Out[3]:
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes agent company days_in_waiting_list adr required_car_parking_spaces total_of_special_requests
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119386.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 103050.000000 6797.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554 27.165173 15.798241 0.927599 2.500302 1.856403 0.103890 0.007949 0.031912 0.087118 0.137097 0.221124 86.693382 189.266735 2.321149 101.831122 0.062518 0.571363
std 0.482918 106.863097 0.707476 13.605138 8.780829 0.998613 1.908286 0.579261 0.398561 0.097436 0.175767 0.844336 1.497437 0.652306 110.774548 131.655015 17.594721 50.535790 0.245291 0.792798
min 0.000000 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 6.000000 0.000000 -6.380000 0.000000 0.000000
25% 0.000000 18.000000 2016.000000 16.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 62.000000 0.000000 69.290000 0.000000 0.000000
50% 0.000000 69.000000 2016.000000 28.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000 179.000000 0.000000 94.575000 0.000000 0.000000
75% 1.000000 160.000000 2017.000000 38.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 229.000000 270.000000 0.000000 126.000000 0.000000 1.000000
max 1.000000 737.000000 2017.000000 53.000000 31.000000 19.000000 50.000000 55.000000 10.000000 10.000000 1.000000 26.000000 72.000000 21.000000 535.000000 543.000000 391.000000 5400.000000 8.000000 5.000000
In [4]:
# Returns the dimensions of the DataFrame as (number of rows, number of columns)
df.shape
Out[4]:
(119390, 32)
In [5]:
# Returns an index object containing the column labels of the DataFrame.
df.columns
Out[5]:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')
In [6]:
# Provides a concise summary of the DataFrame including data types, non-null values, and memory usage.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

Attributes Kept¶

We narrowed down the dataset from 32 attributes to 11 key features that closely align with our objective and prediction task.

  • Aggregated Features:

    • num_of_guests: This combines adults, children, and babies because the total number of guests is more likely to influence cancellations than each guest type separately.
    • length_of_stay: This merges stays_in_weekend_nights and stays_in_week_nights to reflect the total duration of the stay, which is more significant than the specific days of the week.
  • Individual Attributes:

    • hotel: This captures whether the booking is for a resort or city hotel, which is important for understanding cancellation trends across different types of hotels (i.e. vacations vs. business trips).
    • is_repeated_guest, previous_cancellations, booking_changes: These provide insights into guest behavior, including past cancellations and modifications, which could predict future cancellations.
    • adr: The average daily rate paid by guests, which helps reveal if cancellations are more likely at certain price points.
    • deposit_type: Reflects whether a deposit was made, which is important in determining the likelihood of follow-through on the reservation.
    • lead_time: The amount of time between booking and arrival, a key factor in understanding guest decision patterns.
    • reservation_status_date: We wanted to investigate if particular months in the year yield higher cancellation rates.
      • This will be converted to month_year in order to extract monthly data per year.
    • customer_type: We wanted to analyze if cancellations were more common in particular types of customers.

These attributes were selected based on their relevance to predicting cancellations and their potential to provide meaningful insights into guest behaviors and booking patterns.

Attributes Collected in Data¶

Categorical:¶
  • hotel: Indicates the type of hotel (resort or city).
    • bool
    • 0=Resort Hotel, 1=City Hotel
  • is_canceled: Specifies whether the booking was canceled or not.
    • 0=not canceled, 1=canceled
    • target: int
  • is_repeated_guest: Indicates whether a guest is a repeated visitor.
    • 0=not repeated, 1=repeated
    • bool
  • deposit_type: Indicates the type of deposit made for the booking.
    • One-hot encoded
    • 3 values: 'No Deposit', 'Refundable', 'Non Refund'
  • month_year: Date at which last status was set
    • period[M] to denote monthly frequency for our analysis
  • customer_type: Indicates the type of customers that are booking these reservations. Whether it is a single person, a group of people, families, the categorical feature indicates most customer might be out of state or a mix of both.
    • object
Numerical:¶
  • lead_time: Represents the number of days between the booking date and the arrival date.
    • int
  • length_of_stay: Number of nights spent in total.
    • int
  • num_of_guests: Total number of guests under reservation.
    • int
  • previous_cancellations: Number of times the guest previously canceled their bookings.
    • int
  • booking_changes: Number of changes made to the booking.
    • int
  • adr: Represents the average daily rate (price per room) for the booking.
    • float

Attributes Dropped¶

Irrelvant:¶
  • agent, company, market_segment, and distribution_channel will likely not be directly relevant to predicting cancellations.
Reundancy:¶
  • arrival_date_year, arrival_date_month, arrival_date_week_number, and arrival_date_day_of_month all describe the same arrival date in different form, making it redundant to keep all of them.
Low Impact:¶
  • required_car_parking_spaces, days_in_waiting_list, and total_of_special_requests do not yield enough predictive power toward our prediction task.
Data Leakage:¶
  • reservation_status leaks the outcome of booking (cancelations), feeding our model the outcome instead of allowing it to fully predict.
In [7]:
# Attributes Dropped
df.drop(['arrival_date_year', 'arrival_date_month', 
         'arrival_date_week_number', 
         'arrival_date_day_of_month', 
         'agent', 
         'company', 
         'required_car_parking_spaces',
         'meal',
         'country',
         'market_segment',
         'distribution_channel',
         'previous_bookings_not_canceled',
         'reserved_room_type',
         'assigned_room_type',
         'days_in_waiting_list',
         'required_car_parking_spaces',
         'reservation_status',
         'total_of_special_requests'
         ],
        axis=1,
        inplace=True)
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   hotel                    119390 non-null  object 
 1   is_canceled              119390 non-null  int64  
 2   lead_time                119390 non-null  int64  
 3   stays_in_weekend_nights  119390 non-null  int64  
 4   stays_in_week_nights     119390 non-null  int64  
 5   adults                   119390 non-null  int64  
 6   children                 119386 non-null  float64
 7   babies                   119390 non-null  int64  
 8   is_repeated_guest        119390 non-null  int64  
 9   previous_cancellations   119390 non-null  int64  
 10  booking_changes          119390 non-null  int64  
 11  deposit_type             119390 non-null  object 
 12  customer_type            119390 non-null  object 
 13  adr                      119390 non-null  float64
 14  reservation_status_date  119390 non-null  object 
dtypes: float64(2), int64(9), object(4)
memory usage: 13.7+ MB

2.2 Verifying Data Quality¶

Handling Missing and Duplicated Data¶

Before proceeding with feature aggregation, it’s crucial to ensure data quality by addressing missing and duplicated values.

Missing Data:¶

  • Identified Issue: The children attribute had 4 missing values out of 119,390 entries. These missing values were likely caused by users leaving the "number of children" field blank during the booking process.
  • Analysis and Visualization: To verify this, we visualized missing data using a bar plot, which showed that only the children column had missing entries, and the extent of missingness was minimal.
  • Justification for Imputation: Given that there were only 4 missing values and that these likely represent bookings without children, we imputed the missing values with 0. Since the impact is minor and the missing data represents less than 0.01% of the total entries, this method prevents skewing the dataset without requiring more advanced imputation techniques like KNN, which would be unnecessary for such a small fraction.

Duplicated Data:¶

  • Identified Issue: We detected 41,787 duplicate entries out of 119,390 total entries (about 35% of the data).
  • Potential Causes: Duplicates may have resulted from merging different systems or accidental/intentional multiple bookings by the same guests.
  • Visualization: We confirmed the presence of duplicates by simply looking at the first 6 entries in the dataset and spotting a duplicate in the 5th and 6th entries.
  • Justification for Elimination: After examining the dataset, we decided to drop all duplicates. The reason for this is to prevent over-representation of certain guests or bookings, which could introduce bias in our model. By removing these 41,787 duplicate entries, we ensure that each entry represents a unique booking instance. After cleaning, we have over 70,000 unique entries, which remains a robust dataset for analysis.
In [9]:
# Check for missing values
df.isnull().sum()
Out[9]:
hotel                      0
is_canceled                0
lead_time                  0
stays_in_weekend_nights    0
stays_in_week_nights       0
adults                     0
children                   4
babies                     0
is_repeated_guest          0
previous_cancellations     0
booking_changes            0
deposit_type               0
customer_type              0
adr                        0
reservation_status_date    0
dtype: int64
In [10]:
# Counting the missing values per col/attribute
missing_counts = df.isnull().sum()

# Set figure size
plt.figure(figsize=(10, 6))

# Plot missing values
missing_counts[missing_counts > 0].plot(kind='bar')

# Set title labels, rotate x-axis
plt.title('Missing Values Count by Column')
plt.xlabel('Column')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)

# Display
plt.show()
No description has been provided for this image
In [11]:
# Checking for Duplicates
df.duplicated().sum()
Out[11]:
np.int64(36256)
In [12]:
# Example: Last two entries here are duplicates!
df.head(6)
Out[12]:
hotel is_canceled lead_time stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations booking_changes deposit_type customer_type adr reservation_status_date
0 Resort Hotel 0 342 0 0 2 0.0 0 0 0 3 No Deposit Transient 0.0 01-07-15
1 Resort Hotel 0 737 0 0 2 0.0 0 0 0 4 No Deposit Transient 0.0 01-07-15
2 Resort Hotel 0 7 0 1 1 0.0 0 0 0 0 No Deposit Transient 75.0 02-07-15
3 Resort Hotel 0 13 0 1 1 0.0 0 0 0 0 No Deposit Transient 75.0 02-07-15
4 Resort Hotel 0 14 0 2 2 0.0 0 0 0 0 No Deposit Transient 98.0 03-07-15
5 Resort Hotel 0 14 0 2 2 0.0 0 0 0 0 No Deposit Transient 98.0 03-07-15

Converting to Appropriate Data Types¶

  • Now that we've discovered missing and duplicated data, we can proceed to impute and eliminate, respectively.
  • But first, we'll make sure to convert to appropriate data types if needed!
    • We also need to aggregate a few attributes
In [13]:
# Convert hotel to boolean
df['hotel'] = df['hotel'].map({'Resort Hotel': 0, 'City Hotel': 1}).astype('bool')

# Convert is_repeated_guest to boolean
df['is_repeated_guest'] = df['is_repeated_guest'].astype('bool')

# One hot encoding deposit_type, indicating presence (1) or absence (0).
if 'deposit_type' in df.columns:
    df = pd.get_dummies(df, columns=['deposit_type'], prefix=['deposit_type'])

# Imputing misisng values, children attribute with 0
df['children'] = df['children'].fillna(0).astype('int') 

# Converting reservation_status_date to month_year
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format='%d-%m-%y', errors='coerce')

# Create month_year column in monthly period format
df['month_year'] = df['reservation_status_date'].dt.to_period('M')

# Filter Dataframe
start_period = pd.Period('2015-07', freq='M')
df = df[df['month_year'] >= start_period]
In [14]:
# Aggregation of 'adults', 'children', and 'babies' into 'num_of_guests'
df['num_of_guests'] = df['adults'] + df['children'] + df['babies']

# Aggregation of 'stays_in_weekend_nights' and 'stays_in_week_nights' into 'length_of_stay'
df['length_of_stay'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

# Dropping the attributes we aggregated
df = df.drop(columns=['adults', 'children', 'babies', 'stays_in_weekend_nights', 'stays_in_week_nights'])
In [15]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 117040 entries, 0 to 119389
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   hotel                    117040 non-null  bool          
 1   is_canceled              117040 non-null  int64         
 2   lead_time                117040 non-null  int64         
 3   is_repeated_guest        117040 non-null  bool          
 4   previous_cancellations   117040 non-null  int64         
 5   booking_changes          117040 non-null  int64         
 6   customer_type            117040 non-null  object        
 7   adr                      117040 non-null  float64       
 8   reservation_status_date  117040 non-null  datetime64[ns]
 9   deposit_type_No Deposit  117040 non-null  bool          
 10  deposit_type_Non Refund  117040 non-null  bool          
 11  deposit_type_Refundable  117040 non-null  bool          
 12  month_year               117040 non-null  period[M]     
 13  num_of_guests            117040 non-null  int64         
 14  length_of_stay           117040 non-null  int64         
dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1)
memory usage: 10.4+ MB
In [16]:
# No more null values!
df.isnull().sum()
Out[16]:
hotel                      0
is_canceled                0
lead_time                  0
is_repeated_guest          0
previous_cancellations     0
booking_changes            0
customer_type              0
adr                        0
reservation_status_date    0
deposit_type_No Deposit    0
deposit_type_Non Refund    0
deposit_type_Refundable    0
month_year                 0
num_of_guests              0
length_of_stay             0
dtype: int64
In [17]:
# Dropping Duplicates
df.drop_duplicates(inplace=True)
In [18]:
# No more duplicates!
df.duplicated().sum()
Out[18]:
np.int64(0)
In [19]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 82363 entries, 0 to 119389
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   hotel                    82363 non-null  bool          
 1   is_canceled              82363 non-null  int64         
 2   lead_time                82363 non-null  int64         
 3   is_repeated_guest        82363 non-null  bool          
 4   previous_cancellations   82363 non-null  int64         
 5   booking_changes          82363 non-null  int64         
 6   customer_type            82363 non-null  object        
 7   adr                      82363 non-null  float64       
 8   reservation_status_date  82363 non-null  datetime64[ns]
 9   deposit_type_No Deposit  82363 non-null  bool          
 10  deposit_type_Non Refund  82363 non-null  bool          
 11  deposit_type_Refundable  82363 non-null  bool          
 12  month_year               82363 non-null  period[M]     
 13  num_of_guests            82363 non-null  int64         
 14  length_of_stay           82363 non-null  int64         
dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1)
memory usage: 7.3+ MB
In [20]:
# Reset the index just so it's easier to read the rows/entries
df = df.reset_index(drop=True)
df.tail()
Out[20]:
hotel is_canceled lead_time is_repeated_guest previous_cancellations booking_changes customer_type adr reservation_status_date deposit_type_No Deposit deposit_type_Non Refund deposit_type_Refundable month_year num_of_guests length_of_stay
82358 True 0 23 False 0 0 Transient 96.14 2017-09-06 True False False 2017-09 2 7
82359 True 0 102 False 0 0 Transient 225.43 2017-09-07 True False False 2017-09 3 7
82360 True 0 34 False 0 0 Transient 157.71 2017-09-07 True False False 2017-09 2 7
82361 True 0 109 False 0 0 Transient 104.40 2017-09-07 True False False 2017-09 2 7
82362 True 0 205 False 0 0 Transient 151.20 2017-09-07 True False False 2017-09 2 9
In [21]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82363 entries, 0 to 82362
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   hotel                    82363 non-null  bool          
 1   is_canceled              82363 non-null  int64         
 2   lead_time                82363 non-null  int64         
 3   is_repeated_guest        82363 non-null  bool          
 4   previous_cancellations   82363 non-null  int64         
 5   booking_changes          82363 non-null  int64         
 6   customer_type            82363 non-null  object        
 7   adr                      82363 non-null  float64       
 8   reservation_status_date  82363 non-null  datetime64[ns]
 9   deposit_type_No Deposit  82363 non-null  bool          
 10  deposit_type_Non Refund  82363 non-null  bool          
 11  deposit_type_Refundable  82363 non-null  bool          
 12  month_year               82363 non-null  period[M]     
 13  num_of_guests            82363 non-null  int64         
 14  length_of_stay           82363 non-null  int64         
dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1)
memory usage: 6.7+ MB

One Last Thing: Checking For Outliers¶

In [22]:
df.describe()
Out[22]:
is_canceled lead_time previous_cancellations booking_changes adr reservation_status_date num_of_guests length_of_stay
count 82363.000000 82363.000000 82363.000000 82363.000000 82363.000000 82363 82363.000000 82363.000000
mean 0.280672 78.356434 0.026468 0.273120 107.326994 2016-09-07 04:31:01.826305536 2.039265 3.661280
min 0.000000 0.000000 0.000000 0.000000 -6.380000 2015-07-01 00:00:00 0.000000 0.000000
25% 0.000000 11.000000 0.000000 0.000000 72.750000 2016-03-25 00:00:00 2.000000 2.000000
50% 0.000000 48.000000 0.000000 0.000000 99.000000 2016-09-13 00:00:00 2.000000 3.000000
75% 1.000000 123.000000 0.000000 0.000000 135.000000 2017-03-08 00:00:00 2.000000 5.000000
max 1.000000 737.000000 13.000000 21.000000 5400.000000 2017-09-14 00:00:00 20.000000 69.000000
std 0.449330 84.411887 0.298818 0.736445 55.641959 NaN 0.712955 2.785946
  • There seems to be extreme values found in the following attributes: lead_time with a max of 737 and mean of 84.37, and adr with a max of 5400 and mean of 110.14.
    • These two can have the most significant negative impact on the analysis of predicting cancellations as it may bias towards these values.
    • In addition, adr has a min value of -6.38 which seems like a data entry error unless the hotel paid someone $6.38 to reserve a room (highly unlikely)
      • In this case, we'll ensure adr values are greater than 0 and remove any entries that are not.
  • We decided to apply IQR as lead_time and adr are skewed and using this method would be more robust.
    • lead_time (most people book within a short time, but a few might book very far in advanced)
    • adr (high-end bookings or promotions that could cause large deviations)
In [23]:
# Resolving the negative value in adr
df = df[df['adr'] >= 0]
In [24]:
# IQR Function
def remove_outliers_iqr(df, column):
    # Calculate the first and third quartile
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)

    ## Compute the Interquartile Range (IQR)
    IQR = Q3 - Q1

    # Define lower and upper bounds for outlier detection
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter the DataFrame to remove outliers
    df= df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df
In [25]:
# Define columns for which outlier to remove
columns_to_filter = ['lead_time', 'adr']

# Iterate and apply function
for col in columns_to_filter:
    df = remove_outliers_iqr(df, col)
In [26]:
df = df.reset_index(drop=True)
In [27]:
# Extreme outliers from `lead_time` and `adr` have been removed! 
df.describe()
Out[27]:
is_canceled lead_time previous_cancellations booking_changes adr reservation_status_date num_of_guests length_of_stay
count 78017.000000 78017.000000 78017.000000 78017.000000 78017.000000 78017 78017.000000 78017.000000
mean 0.272133 71.019803 0.026751 0.263763 103.314852 2016-09-01 20:40:41.744748032 2.010357 3.598036
min 0.000000 0.000000 0.000000 0.000000 0.000000 2015-07-01 00:00:00 0.000000 0.000000
25% 0.000000 10.000000 0.000000 0.000000 72.250000 2016-03-21 00:00:00 2.000000 2.000000
50% 0.000000 45.000000 0.000000 0.000000 98.000000 2016-09-08 00:00:00 2.000000 3.000000
75% 1.000000 115.000000 0.000000 0.000000 132.000000 2017-03-02 00:00:00 2.000000 5.000000
max 1.000000 291.000000 13.000000 21.000000 230.000000 2017-09-14 00:00:00 12.000000 69.000000
std 0.445061 72.372885 0.304994 0.719954 46.218942 NaN 0.680556 2.745294
In [28]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78017 entries, 0 to 78016
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   hotel                    78017 non-null  bool          
 1   is_canceled              78017 non-null  int64         
 2   lead_time                78017 non-null  int64         
 3   is_repeated_guest        78017 non-null  bool          
 4   previous_cancellations   78017 non-null  int64         
 5   booking_changes          78017 non-null  int64         
 6   customer_type            78017 non-null  object        
 7   adr                      78017 non-null  float64       
 8   reservation_status_date  78017 non-null  datetime64[ns]
 9   deposit_type_No Deposit  78017 non-null  bool          
 10  deposit_type_Non Refund  78017 non-null  bool          
 11  deposit_type_Refundable  78017 non-null  bool          
 12  month_year               78017 non-null  period[M]     
 13  num_of_guests            78017 non-null  int64         
 14  length_of_stay           78017 non-null  int64         
dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1)
memory usage: 6.3+ MB

Data Understanding Summary¶

  • Data Types: All attributes are now represented in their appropriate data types.
  • Feature Aggregation: Aggregated attributes as needed for efficiency.
  • Missing Values: Imputed missing values to ensure completeness.
  • Duplicates: Removed duplicate entries to avoid over-representation and bias.
  • Outliers: Removed extreme outliers present in a few attributes

3. Data Visualization¶

3.1 Visualizing Basic Feature Distributions¶

In [29]:
# Cross Tabulation Showing Cancellation Counts Among Numerical Attributes
# List of numerical columns to analyze
numerical = ['lead_time', 'previous_cancellations', 'booking_changes', 'adr', 'num_of_guests', 'length_of_stay']

# Create copy of df to not modify original data
crosstab_df = copy.deepcopy(df)
crosstab_df['is_canceled'] = df['is_canceled'].map({0: 'Not Canceled', 1: 'Canceled'})

# Num of numerical columns
L = len(numerical)

# Num of coloumns in grid and calculate num of rows
ncol = 3
nrow = int(np.ceil(L / ncol))

# Setup matplotlib and axis
fig, axes = plt.subplots(nrow, ncol, figsize=(15, 5 * nrow))
fig.subplots_adjust(top=0.92)
axes = axes.flatten()

# Loop over numerical column to create histogram and stack
for i, col in enumerate(numerical):
    sns.histplot(data=crosstab_df, x=col, hue="is_canceled", multiple="stack", ax=axes[i], bins=30, palette="coolwarm")
    # Set title, labels, enable grid
    axes[i].set_title(f'Distribution of {col}', fontsize=14)
    axes[i].set_xlabel(f'{col} Value', fontsize=12)
    axes[i].set_ylabel('Frequency', fontsize=12)
    axes[i].grid(True, which='both', linestyle='--', linewidth=0.5)
    sns.despine(ax=axes[i])

# If empty subplots delete axes
for j in range(L, nrow * ncol):
    fig.delaxes(axes[j])

# Cleaner look and display
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis:¶

  • lead_time: non-cancellations have a higher frequency than cancellations at low lead times; as lead time increases, both cancellations and noncancellations gradually decrease, but cancellations tend to remain at similar levels throughout higher lead times.
  • previous_cancellations: guests with few previous or no previous cancellations are less likely to cancel their reservations
  • booking_changes: guests with few booking changes are less likely to cancel their reservations
  • adr: both cancellations and noncancellations exhibit similar rates as adr values increase, indicating that after a certain amount, both categories are likely to exhibit similar behaviors.
  • num_of_guests: guests with fewer party sizes are likely to cancel more often than with larger, likely due to the nature of it being more flexible to cancel with fewer members in their group
  • length_of_stay: length of stay exhibits less cancellations with shorter periods, but cancellations are still likely to occur during short visits due to the flexibility of being less committed to longer stays.
In [30]:
# Scatterplot of Lead Time vs ADR with Cancellation Status

# Create copy of df to not modify original data and map binary cancellation w/ readable labels
scatter_copy = copy.deepcopy(df)
scatter_copy['is_canceled'] = scatter_copy['is_canceled'].map({0: 'Not Canceled', 1: 'Canceled'})

# Set figure size
plt.figure(figsize=(12, 7))

# Create scatterplot lead time vs ADR
sns.scatterplot(data=scatter_copy, x='lead_time', y='adr', hue='is_canceled', s=50, alpha=0.7, palette='pastel')

# Set title, labels, add legend
plt.title('Scatterplot of Lead Time vs ADR with Cancellation Status')
plt.xlabel('Lead Time')
plt.ylabel('Average Daily Rate (ADR)')
plt.legend(title='Cancellation Status')

# Cleaner look and display
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis:¶

  • The scatterplot reveals that most reservations that have high average daily rates and high lead times get cancelled.
  • The cancellations appear to cluster more as lead time increases and average daily rate stays around the median range. This could indicate that higher lead times are more likely to result in cancellations, which is something stakeholders can take into consideration when planning their operations.
  • The non-cancellations cluster more towards lower lead times and median average daily rate. This makes sense because the less time they have to wait between making a reservation and arriving to the hotel, the less likely guests are to cancel.
In [31]:
# Box Plot Displaying the Correlation between Length of Stay and Cancellation Status
# Set figure size
plt.figure(figsize=(10, 6))

# Create box plot to display correlation length of stay and cancellation
ax = sns.boxplot(x='is_canceled', y='length_of_stay', data=df, 
                 hue='is_canceled', palette='pastel')

# Create dictionary to map cancellation status
labels = {0: 'Not Canceled', 1: 'Canceled'}

# Set the xaxis, and assign labels
ax.set_xticks([0, 1]) 
ax.set_xticklabels([labels[0], labels[1]])

# Set titles and labels, remove the legend
plt.xlabel('Cancellation Status')
plt.ylabel('Length of Stay')
plt.title('Length of Stay by Cancellation Status')
ax.legend_.remove()

# Display
plt.show()
No description has been provided for this image

Analysis:¶

  • Median Comparison: The medians for both Cancelled and Not Cancelled categories are quite similar. This means that, on average, guests who cancel their bookings stay for about the same amount of time as those who don’t.

  • Extreme Values in Not Cancelled: For guests who do not cancel, there are more extreme values in the length of stay. This suggests that guests who end up keeping their reservations tend to have a wider range of stay durations, with some staying much longer than others.

  • Consistency in Cancelled: On the other hand, the Cancelled group has fewer extreme values. This indicates that guests who cancel their bookings tend to have more consistent stay lengths. Longer stays might be more prone to cancellation, possibly due to changing plans or uncertainties.

  • Even though the medians are similar, the spread of stay lengths shows that there might be more to uncover. It would be helpful to dive deeper into the data to see if length of stay really impacts cancellation rates.

In [32]:
# Distribution of Average Daily Rate
# Set style of seaborn plot and context
sns.set_style("darkgrid")
sns.set_context("talk")  

# Set figure size
plt.figure(figsize=(10, 6))

# Create Histogram with KDE
sns.histplot(data=df, x='adr', kde=True, bins=20, color='skyblue', line_kws={'color': 'darkblue', 'lw': 3})

# Set title, labels, grid lines
plt.title('Distribution of Average Daily Rate', fontsize=16, weight='bold')
plt.xlabel('Average Daily Rate', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.7)

# Display
plt.show()

# Statistics
df['adr'].describe()
No description has been provided for this image
Out[32]:
count    78017.000000
mean       103.314852
std         46.218942
min          0.000000
25%         72.250000
50%         98.000000
75%        132.000000
max        230.000000
Name: adr, dtype: float64

Distribution Insights:¶

  • Average Daily Rate Overview:

    • The average daily rate (ADR) of hotel stays ranges from a minimum of 0 to a maximum of 235.
    • The mean ADR is approximately $106.40.
    • The median ADR is close to the mean, at $100.07, suggesting a relatively symmetric distribution of ADR values around the center.
  • Variability in ADR:

    • The standard deviation of 46.34 reveals a significant spread around the average ADR.
    • This suggests that there is considerable variability in the prices of hotel rooms.
    • Some guests are paying much more or less than the average rate, reflecting a range of pricing strategies or room types.
  • Percentile Distribution:

    • The 25th percentile at 75.00 indicates that 25% of the ADR values are below 75.00.
    • The 75th percentile at 135.00 shows that 75% of the ADR values are below 135.00.
  • Histogram and Kernel Density Estimate (KDE):

    • The peak of the histogram and KDE curve suggests that most ADR values are concentrated around $100, with a gradual decline in frequency for higher and lower ADR values.
  • Observations from the Plot:

    • The distribution shows a skew towards higher ADR values, with a notable frequency of ADRs around the $100 mark.
    • There is a long tail on the higher end, indicating that while most ADRs are clustered around the middle range, there are some high-value outliers.
    • We decided to keep those high-value outliers as they are probably from reserving rooms with more features and ammenities.
In [33]:
# Correlation Heatmap of Attributes
# Create copy of df to not modify original data
df_temp = copy.deepcopy(df)

# Converting boolean columns to integers in the temporary DataFrame
bool_cols = ['hotel', 'is_repeated_guest', 'deposit_type_No Deposit', 'deposit_type_Non Refund', 'deposit_type_Refundable']
df_temp[bool_cols] = df_temp[bool_cols].astype(int)

# Select column for correlation analysis, and compute
correlation_cols = df_temp[['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'adr', 'num_of_guests', 'length_of_stay'] + bool_cols]
correlation_matrix = correlation_cols.corr()

# Set figure size
plt.figure(figsize=(12, 10))

# Create heatmap
sns.heatmap(correlation_matrix, annot=True, cmap="viridis", linewidths=0.5, fmt=".2f")

# Set title and display
plt.title('Correlation Heatmap of Attributes')
plt.show()
No description has been provided for this image
Analysis:¶
  • General Trends:

    • Average Daily Rate (adr): The positive correlation with both lead_time (0.067) and num_of_guests (0.417) suggests that as the average daily rate increases, bookings tend to have longer lead times and more guests. This can indicate that higher rates might be associated with more advanced planning and potentially larger groups, both of which could impact cancellation behavior. Higher rates could also mean higher stakes, possibly leading to a greater chance of cancellation if plans change.
  • Low Correlations:

    • Booking Cancellations (is_canceled): The weak correlations with most features suggest that cancellations are influenced by multiple factors, but not strongly by any single one. The stronger correlations with lead_time (0.174) and adr (0.125) indicate that cancellations might be somewhat more likely with longer lead times and higher rates. This aligns with intuition, as guests may have second thoughts or face changing circumstances over a longer period, or higher rates might lead to reconsideration or financial constraints.
  • Significant Negative Correlations:

    • Deposit Types:
      • deposit_type_No Deposit and deposit_type_Non Refund: The strong negative correlation (-0.955) indicates that bookings with no deposit are rarely non-refundable. This can be useful for predicting cancellations because if a booking has no deposit, it might be more flexible and therefore more likely to be canceled.
      • deposit_type_No Deposit and deposit_type_Refundable: The negative correlation (-0.292) shows that bookings with no deposit are less likely to be refundable. This might affect cancellation behavior because non-refundable deposits can deter cancellations due to the financial penalty, while no deposit options might lead to higher cancellation rates.
  • Moderate Correlations:

    • Lead Time and Length of Stay: The moderate positive correlation (0.333) between lead_time and length_of_stay suggests that bookings with longer lead times tend to have longer stays. This relationship can help predict cancellations because longer bookings with extended lead times might be more vulnerable to changes in plans or unexpected events. Thus, longer lead times could potentially be associated with higher cancellation rates.
    • Number of Guests and Length of Stay: The small positive correlation (0.095) between num_of_guests and length_of_stay indicates that the number of guests has a minor influence on the duration of their stay. This relationship is less significant, but it could still be relevant for understanding cancellation trends if larger groups are more likely to cancel or adjust plans.
  • Low Impact Features:

    • Previous Cancellations and Booking Changes: Both features having low correlations with other attributes suggest they have less impact on cancellation predictions. However, analyzing these features individually might still provide insights into patterns of cancellations, especially if past behavior or frequent changes in bookings are indicators of a higher likelihood of cancellation.
  • Hotel Type and Repeated Guests:

    • Correlation Insights: The weak correlations of hotel and is_repeated_guest with other features suggest that the type of hotel and whether a guest is a repeat customer don’t strongly affect other attributes. However, understanding these factors could still be useful in a broader context. For instance, repeat guests might have different cancellation behaviors compared to first-time guests, or certain types of hotels might have policies or pricing structures that influence cancellation rates.
  • To leverage these correlations for predicting cancellations, we can focus on adr and lead_time since these features show some correlation with cancellations, and can be key predictors.

In [34]:
# Pie Chart Displaying Cancellation Rates for Each Deposit Type
deposit_grouped = df.groupby(
    ['deposit_type_No Deposit', 'deposit_type_Non Refund', 'deposit_type_Refundable', 'is_canceled']
).size().unstack(fill_value=0)

# Aggregating cancellations  for each deposit type
cancellation_counts = deposit_grouped.loc[:, 1]

# Calculating cancellations for each deposit type directly
cancellation_no_deposit = df[(df['deposit_type_No Deposit'] == True) & (df['is_canceled'] == 1)].shape[0]
cancellation_non_refund = df[(df['deposit_type_Non Refund'] == True) & (df['is_canceled'] == 1)].shape[0]
cancellation_refundable = df[(df['deposit_type_Refundable'] == True) & (df['is_canceled'] == 1)].shape[0]

# Updating cancellation_counts with recalculated values
cancellation_counts = pd.Series({
    'No Deposit': cancellation_no_deposit,
    'Non Refund': cancellation_non_refund,
    'Refundable': cancellation_refundable
})

# Calculating percentages
total_cancellations = cancellation_counts.sum()
percentages = (cancellation_counts / total_cancellations) * 100

# Creating labels for the legend with percentages
legend_labels = [f'{label} - {percent:.1f}%' for label, percent in zip(cancellation_counts.index, percentages)]
colors = ['#ff9999', '#66b3ff', '#99ff99']

# Creating the pie chart without labels and percentages on the pie itself
plt.figure(figsize=(10, 7))
patches, _ = plt.pie(cancellation_counts, startangle=90, colors=colors, wedgeprops={'edgecolor': 'black'}, labels=None)

# Adjusting legend to include percentages
plt.legend(patches, legend_labels, loc='center left', bbox_to_anchor=(1, 0.5), fontsize=8, title='Deposit Type')
plt.title('Cancellations by Deposit Type')
plt.show()

# Displaying the number of cancellations for each deposit type
print('Number of Cancellations by Deposit Type:')
cancellation_counts
No description has been provided for this image
Number of Cancellations by Deposit Type:
Out[34]:
No Deposit    20554
Non Refund      654
Refundable       23
dtype: int64

Analysis¶

  • No Deposit:

    • Higher Cancellation Rates: Most common deposit type; likely associated with higher cancellations due to lower guest commitment.
    • Potential Indicator: Frequent cancellations suggest that requiring a deposit could improve guest commitment and reduce cancellations.
  • Non-Refundable:

    • Lower Cancellation Rates: Higher commitment due to financial penalty, resulting in fewer cancellations.
    • Predictive Value: Lower cancellations indicate that guests are more committed when a financial penalty is involved.
  • Refundable:

    • Lowest Cancellation Rates: Offers flexibility but might still result in cancellations if guests are uncertain.
    • Predictive Insights: Low number of refundable bookings and low cancellations suggest that while flexibility is valued, it doesn’t guarantee commitment.

Question 1: What is the impact of repeat guest status on cancellation rates across different hotel types?¶

In [35]:
# Cancellation Rates by Repeat Status in Each Hotel Type
# Group data by repeated guest status/hotel type and calculate mean
cancellation_rate = df.groupby(['is_repeated_guest', 'hotel'])['is_canceled'].mean().reset_index()

# Mapping boolean values to appropriate labels
cancellation_rate['is_repeated_guest'] = cancellation_rate['is_repeated_guest'].map({True: 'Repeated', False: 'Not Repeated'})
cancellation_rate['hotel'] = cancellation_rate['hotel'].map({True: 'City Hotel', False: 'Resort Hotel'})

# Setting plot style and context with custom colors
sns.set_style("darkgrid")
sns.set_context("talk")
colors = ['#66c2a5', '#fc8d62']  

# Create a bar plot to visualize cancellation rate
plt.figure(figsize=(10, 6))
sns.barplot(x='is_repeated_guest', y='is_canceled', hue='hotel', data=cancellation_rate, palette=colors)

# Set title and labels
plt.title('Cancellation Rate by Repeated Guest Status and Hotel Type', fontsize=16, weight='bold')
plt.xlabel('Repeated Guest Status', fontsize=14)
plt.ylabel('Cancellation Rate', fontsize=14)

# Customize legend, add grid for readability
plt.legend(title='Hotel Type', fontsize=12, title_fontsize='13')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Output cancellation rate
cancellation_rate
No description has been provided for this image
Out[35]:
is_repeated_guest hotel is_canceled
0 Not Repeated Resort Hotel 0.229029
1 Not Repeated City Hotel 0.311461
2 Repeated Resort Hotel 0.037760
3 Repeated City Hotel 0.105911

Non-Repeated Guests Cancellation Rate¶

  • City Hotel: Non-repeated guests have a cancellation rate of ~32%.

  • Resort Hotel: Non-repeated guests have a cancellation rate of ~23%.

  • Non-repeated guests tend to cancel more often, especially in city hotels. This could indicate that first-time visitors to city hotels are more likely to have unpredictable plans or different expectations, leading to cancellations. Resort hotels seem to fare slightly better with non-repeated guests in terms of cancellations.

Repeated Guests Cancellation Rate¶

  • City Hotel: Repeated guests have a cancellation rate of ~11%.

  • Resort Hotel: Repeated guests have a much lower cancellation rate of ~4%.

  • Guests who have stayed before tend to cancel less frequently, especially at resort hotels, where the cancellation rate is very low. This suggests that resort hotels are successful at retaining loyal guests who follow through with their reservations, possibly through rewards programs or exceptional service.

City Hotels Experience Higher Cancellation Rates¶

  • Across both guest types (repeated and non-repeated), city hotels face higher cancellation rates.
  • This trend might reflect the different nature of trips for city hotels, where plans are possibly more subject to change, compared to resort stays that are likely more planned and leisure-oriented.

Question 2: How do cancellation rates change over time, and are there noticeable trends or patterns in cancellations for each type of hotel?¶

In [36]:
# Relationship of month_year and cancellations across the two types of hotels
monthly_cancellations = df.groupby(['month_year', 'hotel'])['is_canceled'].mean().reset_index()

# Converting 'month_year' to string format for plotting
monthly_cancellations['month_year'] = monthly_cancellations['month_year'].astype(str)
monthly_cancellations['hotel'] = monthly_cancellations['hotel'].map({True: 'City Hotel', False: 'Resort Hotel'})

# Create line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_cancellations, x='month_year', y='is_canceled', hue='hotel', marker='o')

# Rotate x-axis, add labels, and title
plt.xticks(rotation=45)
plt.xlabel('Month-Year')
plt.ylabel('Average Cancellation Rate')
plt.title('Monthly Average Cancellation Rate by Hotel Type')

# Customize legend
plt.legend(title='Hotel Type')

# Clean and display
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis of Monthly Cancellation Rates¶

Trends Over Time:¶
  • The cancellation rates appear to fluctuate over time for both hotels. Cancellation rates tend to be higher during the first few months of the year and gradually decrease and start to pick back up again towards the end of the year. An explanation for this is that this time is during the holidays and usually around the time when severe storms impact travel. As a result, many are left to cancel their travel plans and reservations. More cancellations are noted for city hotels during these periods, which could indicate that guests are more likely to cancel reservations for city hotels than resort hotels.

  • An interesting trend to note is that cancellations begin to decrease for both hotels at the beginning of the year 2017. Several factors could be in a play for this: one is hotels might have updated their cancellation policies or required deposits to reduce cancellations. Another reason could be that 2017 had an unusual pattern in demand or seasonality (especially since it had the lowest rates during the summer months).

  • The implications for this trend is that hotels should prepare for cancellations during the holiday seasons due to external or internal factors that prevent guests from going through with their reservation. The summer season when children are out of school tend to have fewer cancellations, which could imply that family groups are vacationing and thus less likely to cancel their reservations, which is apparent in the graph. Therefore, we can predict cancellations to occur more often during the last and first few months of the year.

Question 3: Which customer type is more likely to cancel?¶

In [37]:
def cancellation_by_customer_type(df):
    """
    Function to calculate the cancellation rate by customer type.
    
    Args:
    df (pandas.DataFrame): DataFrame containing the columns 'customer_type' and 'is_canceled'.
    
    Returns:
    pandas.DataFrame: A DataFrame with customer types and their corresponding cancellation rates.
    """
    
    # Group by customer_type and calculate the cancellation rate
    cancellation_rate = df.groupby('customer_type')['is_canceled'].mean().reset_index()
    
    # Rename the columns for better readability
    cancellation_rate.columns = ['customer_type', 'cancellation_rate']
    
    # Convert cancellation rate to percentage for easier interpretation
    cancellation_rate['cancellation_rate'] = cancellation_rate['cancellation_rate'] * 100
    
    return cancellation_rate


result = cancellation_by_customer_type(df)
print(result)
     customer_type  cancellation_rate
0         Contract          15.224417
1            Group           6.490872
2        Transient          29.133241
3  Transient-Party          17.274130
In [38]:
def plot_violin_cancellation_by_customer_type(df):
    """
    Function to plot a violin graph showing the distribution of cancellations by customer type.
    
    Args:
    df (pandas.DataFrame): DataFrame containing the columns 'customer_type' and 'is_canceled'.
    """
    # Set up the plot size and style
    plt.figure(figsize=(10, 6))
    sns.set(style="whitegrid")

    # Create the violin plot
    sns.violinplot(x='customer_type', y='is_canceled', data=df, inner="point", hue='customer_type', palette="muted", legend=False)
    
    # Add title and labels
    plt.title('Distribution of Cancellations by Customer Type', fontsize=16)
    plt.xlabel('Customer Type', fontsize=12)
    plt.ylabel('Cancellation (0 = No, 1 = Yes)', fontsize=12)
    
    # Show the plot
    plt.show()

plot_violin_cancellation_by_customer_type(df)
No description has been provided for this image

Analysis¶

Cancellation Distributions by Customer Type¶

  • Broader plots indicate variability in the likelihood of cancellations, unpredictability, and diverse reasons for cancelling. Transient customers have the most cancellation rate over time. Group customers tend to have a lower cancellation rate as compare to the other in the group. This could be due to the fact that with a lot of people in a group, it causes the cancellation rate to be significantly lower as opposed to its peers.

  • The implication we can infer from this data suggest that hotels should focus on offering families holiday packages to incite them toward traveling in groups and maintain a collective, rigid agenda which will press these customers to commit to the vacation and their booking as a whole.

Predictive value: tailoring vacation trips for families with good incentives will ensure business profitability and expansion.

4. Additional Analysis¶

UMAP Dimensionality Reduction Methods¶

  • UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis (https://umap-learn.readthedocs.io/en/latest/how_umap_works.html).
    • It provides a general framework for approaching manifold learning and dimension reduction, as well as providing specific concrete realizations.
    • Essentially, it aims to uncover the underlying structure of data by assuming that high-dimensional data lies on a lower-dimensional manifold.
    • UMAP leverages ideas from topological data to preserve the shape of the data and ensure that similar points remain close together.
    • The key steps in UMAP are to constrcut a high-dimensional graph, create a low-dimensional graph, and optimization.
In [39]:
# UMAP
# Integer columns for UMAP
int_columns = ['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'num_of_guests', 'length_of_stay']
data = df[int_columns]

# Standardizing the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Initializing UMAP model
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2)

# Applying UMAP
embedding = reducer.fit_transform(data_scaled)

# Set figure size, 
plt.figure(figsize=(10, 7))

# Creat scatterplot of UMAP, add colorbar
plt.scatter(embedding[:, 0], embedding[:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
plt.colorbar(label='Is Canceled')

# Set titles, labels
plt.title('UMAP Projection of Hotel Reservation Data')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')

# Display
plt.show()
No description has been provided for this image

UMAP allows us to identify groups of bookings with similar characteristics and see clear patterns of cancellation thanks to the visual dimensionality reduction, decreasing the complexity of multiple features down to 2 dimensions. Clusters of bookings exhibiting high cancellation rates (i.e. bookings with long lead times and several previous cancelations aggregating together) helps tailor policies like tiered cancellation fees based on lead time or more flexible modification options to decrease cancellations.

In [40]:
# UMAP
int_columns = ['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'num_of_guests', 'length_of_stay']
data = df[int_columns]

# Standardizing the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Initialize UMAP model
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2)

# Precomputing embeddings for each frame by slightly varying the min_dist parameter
embeddings = []
for frame in range(20):  # Reducing the number of frames to 20
    reducer = umap.UMAP(n_neighbors=15, min_dist=0.1 + 0.01 * frame, n_components=2)
    embedding = reducer.fit_transform(data_scaled)
    embeddings.append(embedding)

# Setting up the figure for animation
fig, ax = plt.subplots(figsize=(10, 7))

# Initial scatter plot
sc = ax.scatter(embeddings[0][:, 0], embeddings[0][:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
plt.colorbar(sc, label='Is Canceled')
plt.title('UMAP Projection of Hotel Reservation Data')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')

# Function to update the plot for each frame
def update(frame):
    ax.clear()
    
    # Update scatter plot with precomputed embeddings
    embedding = embeddings[frame]
    sc = ax.scatter(embedding[:, 0], embedding[:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
    plt.title(f'UMAP Projection of Hotel Reservation Data (Frame {frame})')
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    
    return sc,

# Creating the animation
ani = FuncAnimation(fig, update, frames=20, interval=300)

# Displaying the animation in the notebook
from IPython.display import HTML
HTML(ani.to_jshtml())
# Saving as GIF 
# ani.save('umap_animation.gif', writer='imagemagick')
c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\manifold\_spectral_embedding.py:329: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\umap\spectral.py:550: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\umap\spectral.py:550: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
Out[40]:
No description has been provided for this image
No description has been provided for this image

The animation shows how UMAP evolves with min_dist parameter varying, causing relationship between datapoints to go from tight and local (small neighborhoods of similar bookings) to spread as mid_dist increases into a more global structure (of broader relationships between groups of bookings).